Data are the core of everything that we do in statistical analysis. Data come in many forms, and I don’t just mean .csv, .xls, .sav, etc. Data can be wide, long, documented, fragmented, messy, and about anything else that you can imagine.
Although data could arguably be more means than end in psychology, the importance of understanding the structure and format of your data cannot overstated. Failure to understand your data could end in improper techniques and flagrantly wrong inferences at worst. This is especially important for longitudinal data. We will discuss many aspects of data handling. One thing to note is that this is just ONE WAY to do it. There are many equivalent.
Why are we thinking about data? Because 80%, maybe more, of your time spent with “analysis” is spent getting data in order and setting up your model of interest.
Aka multivariate vs stacked; person vs person period untidy vs tidy*
Long is what MLM, ggplot2 and tidyverse packages expect whereas SEM and a lot of descriptive are calculated using wide dataframes.

In this figure X, Y, and Z could represent different waves of collection. For each wave we have some value for each of the two people in the dataset. In the long format each person has each wave as a separate row. In the wide each person has their data on a single row.
We will be working with long data for the first half of the class and wide data the second. However, even during the first half we will need to switch back and forth to make sure we can calculate certain values.
The best package to go back and forth between long and wide is the tidyr package, which is part of the tidyverse. Here we will walk through some examples of the primary functions, pivot_wider and pivot_longer
For longitudinal/repeated measures data, each row is an observation. Each person will have multiple rows. You can grab some example data from the class’s github
data <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/example.csv")
example <- data %>%
select(ID, wave, group, DAN)
head(example)
ID wave group DAN
1 6 1 PD 0.1619
2 6 2 PD 0.1677
3 6 3 PD 0.2153
4 29 1 PD 0.1749
5 29 2 PD 0.1356
6 34 1 CTRL 0.1659
The pivot_wider() function takes two arguments: names_from which is the variable whose values will be converted to column names and values_from whose values will be cell values.
wide.ex <- example %>%
pivot_wider(names_from = wave, values_from = DAN)
wide.ex
# A tibble: 91 x 6
ID group `1` `2` `3` `4`
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 6 PD 0.162 0.168 0.215 NA
2 29 PD 0.175 0.136 NA NA
3 34 CTRL 0.166 0.140 NA NA
4 36 CTRL 0.152 0.205 NA NA
5 37 PD 0.219 0.158 0.259 NA
6 48 PD 0.130 0.270 0.248 NA
7 53 CTRL 0.211 0.152 NA NA
8 54 PD 0.220 0.152 0.192 NA
9 58 PD 0.380 0.215 0.204 NA
10 61 PD 0.0818 0.0628 NA NA
# … with 81 more rows
Going back to long:
The pivot_longer function takes three arguments: cols is a list of columns that are to be collapsed. The columns can be referenced by column number or column name. names_to is the name of the new column which will combine all column names. This is up to you to decide what the name is. values_to is the name of the new column which will combine all column values associated with each variable combination.
long.ex <- wide.ex %>%
pivot_longer(cols = '1':'4',
names_to = "wave",
values_to = "DAN")
long.ex
# A tibble: 364 x 4
ID group wave DAN
<int> <chr> <chr> <dbl>
1 6 PD 1 0.162
2 6 PD 2 0.168
3 6 PD 3 0.215
4 6 PD 4 NA
5 29 PD 1 0.175
6 29 PD 2 0.136
7 29 PD 3 NA
8 29 PD 4 NA
9 34 CTRL 1 0.166
10 34 CTRL 2 0.140
# … with 354 more rows
Many times datasets are, for a lack of a better term, messy. We will talk more about the upfront work later to make sure you dont have messy data. However, if you do have messy data there are a number of helpful functions to tidy-up your data.
One common way to represent longitudinal data is to name the variable with a wave signifier.
wide<- tribble(
~ID, ~ext_1, ~ext_2, ~ext_3,
1, 4, 4,4,
2, 6, 5,4,
3, 4,5,6
)
wide
# A tibble: 3 x 4
ID ext_1 ext_2 ext_3
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 4
2 2 6 5 4
3 3 4 5 6
If we went and tried to pivot_longer we’d end up with
wide %>%
pivot_longer(cols = ext_1:ext_3, names_to = "time", values_to = "EXT")
# A tibble: 9 x 3
ID time EXT
<dbl> <chr> <dbl>
1 1 ext_1 4
2 1 ext_2 4
3 1 ext_3 4
4 2 ext_1 6
5 2 ext_2 5
6 2 ext_3 4
7 3 ext_1 4
8 3 ext_2 5
9 3 ext_3 6
The time column is now specific to ext, which is a problem if I have more than one variable that I am pivoting. But, we will end up using wave as our time variable in our model, and time will have to be numeric. So how can we go ahead and separate out the ext part?
One way is to use the separate function
long<- wide %>%
pivot_longer(cols = ext_1:ext_3,
names_to = "time",
values_to = "EXT") %>%
separate(time, into = c("variable", "time"))
long
# A tibble: 9 x 4
ID variable time EXT
<dbl> <chr> <chr> <dbl>
1 1 ext 1 4
2 1 ext 2 4
3 1 ext 3 4
4 2 ext 1 6
5 2 ext 2 5
6 2 ext 3 4
7 3 ext 1 4
8 3 ext 2 5
9 3 ext 3 6
In terms of setting up your data, it is often helpful to include markers that separate parts of the variable eg "_" or “.” A variable that is ext_1 is easier to separate than ext1.
Note, also that the time column is a character rather than numeric. We need to change this so as to use time continuously in our models. There are a few ways to do it, but this is perhaps the most straightforward.
long$time <- as.numeric(long$time)
long
# A tibble: 9 x 4
ID variable time EXT
<dbl> <chr> <dbl> <dbl>
1 1 ext 1 4
2 1 ext 2 4
3 1 ext 3 4
4 2 ext 1 6
5 2 ext 2 5
6 2 ext 3 4
7 3 ext 1 4
8 3 ext 2 5
9 3 ext 3 6
However, something that is a little more elegant is to do both the separating AND the making into numeric in the original pivot_longer function
names_prefix omits what is in there from the new cell names. Previously we had ext_1, ext_2, etc, which we had to seperate with a different function, but this does it within pivot_longer
wide %>%
pivot_longer(cols = ext_1:ext_3,
names_to = "time",
values_to = "EXT",
names_prefix = "ext_")
# A tibble: 9 x 3
ID time EXT
<dbl> <chr> <dbl>
1 1 1 4
2 1 2 4
3 1 3 4
4 2 1 6
5 2 2 5
6 2 3 4
7 3 1 4
8 3 2 5
9 3 3 6
names_transform does any transformations within the variables. Here instead of a separate call, we can make our variables numeric.
wide %>%
pivot_longer(cols = ext_1:ext_3,
names_to = "time",
values_to = "EXT",
names_prefix = "ext_",
names_transform = list(time = as.numeric))
# A tibble: 9 x 3
ID time EXT
<dbl> <dbl> <dbl>
1 1 1 4
2 1 2 4
3 1 3 4
4 2 1 6
5 2 2 5
6 2 3 4
7 3 1 4
8 3 2 5
9 3 3 6
Another common problem that we often face is the need to unite two variables into one. Enter, the creatively titled unite function. Sometimes this happens when our time metric is entered in seperate columns.
df <- tibble(
ID = c(1, 2, 3),
year = c(2020, 2020, 2020),
month = c(1, 1, 1),
day = c(1, 1, 1),
hour = c(4, 2, 5),
min = c(55, 17, 23))
df
# A tibble: 3 x 6
ID year month day hour min
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2020 1 1 4 55
2 2 2020 1 1 2 17
3 3 2020 1 1 5 23
To combine them into one time metric
df %>%
unite(col = time, 5:6, sep=":", remove =TRUE)
# A tibble: 3 x 5
ID year month day time
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2020 1 1 4:55
2 2 2020 1 1 2:17
3 3 2020 1 1 5:23
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). These are called POSIXct in R.
today()
[1] "2021-02-04"
now()
[1] "2021-02-04 08:16:12 CST"
Bringing these into R from some outside place (excel, spss) can lead to confusion, as they can be formatted differently
ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"
You can create these relatively straight forwardly…by hand
ymd_hms("2017-01-31 20:11:59")
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"
Or you can use existing columns variables. This is where the lubridate package comes in handy
df %>%
mutate(t_1 = make_datetime(year, month, day, hour, min))
# A tibble: 3 x 7
ID year month day hour min t_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
1 1 2020 1 1 4 55 2020-01-01 04:55:00
2 2 2020 1 1 2 17 2020-01-01 02:17:00
3 3 2020 1 1 5 23 2020-01-01 05:23:00
Note the t_1 variable is a POSIXct variable type. Once in this format it is much easier to manipulate and work with dates and times.
As with any project, but especially for longitudinal data, one of the most important aspects of data analysis is A. not losing track of what you did and B. being organized. This is much much much harder than said. I find using a combination of 1. rstudio projects 2. git and 3. codebooks are helpful in accomplishing these two goals. We will talk about #1 and #2 but I also encourage you to read about git. These are not the only way to do these sorts of analyses but I feel that exposure to them is helpful, as often in the social sciences these sort of decisions are not discussed.
What these help to do is create a chain of processing where you start with RAW data and end up with the cleaned data. Importantly you can always start over from the raw data. This is important for people wanting to reproduce your findings and or your future self figuring out where a certain variable came from.
We start creating the chain of processing by documenting all of your code, all of it inside. To do so we will be using rmarkdown documents, as the language is easier than LaTeX and more helpful than plaintext.
When I create an rmarkdown document for my own research projects, I always start by setting up 3 components:
Below, we will step through each of these separately, setting ourselves up to (hopefully) flawlessly communicate with R and our data. Note that you do not need to use rmarkdown but I think rmarkdown is much more useful than standard .R syntax.
Packages seems like the most basic step, but it is actually very important. Depending on what gets loaded you might overwrite functions from other packages.(Note: I will often reload or not follow this advice within lectures for didactic reasons, choosing to put library calls above the code)
The second step is a codebook. Arguably, this is the first step because you should create the codebook long before you open R and load your data.
Why a codebook? Well, because you typically have a lot of variables and you will not be able to remember all the details that go into each one of them (rating scale, what the actual item was, was it coded someway, etc). This is especially true now that data are being collected online, which often provides placeholder variable names that then need to be processed somehow. This codebook will serve as a means to document RAW code. It will also allow us to automate some tasks that are somewhat cumbersome, facilitate open data practices, and efficiently see what variables are available. Ultimately, we want to be able to show how we got from the start, with the messy raw data, to our analyses and results at the end? A codebook makes this easier.
To illistrate, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.
For this tutorial, I created the codebook for you, and included what I believe are the core columns you may need. Some of these columns will not be particularly helpful for this dataset. For example, many of you likely work with datasets that have only a single file while others work with datasetsspread across many files (e.g., different waves, different sources). As a result, the “dataset” column of the codebook may only have a single value whereas for others it may have multiple.
Here are my core columns that are based on the original data:
dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.
old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us. If you have worked with qualtrics (really any data) you know why this is important.
item_text: this column is the original text that participants saw or a description of the item.
scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.
reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.
mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.
maxi: this column represents the maximumv alue of scales that are numeric. Leave blank otherwise.
recode: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.
Here are additional columns that will make our lives easier or are applicable to some but not all data sets:
category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.
label: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).
item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.
year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.) if that is appropriate. See Lecture #1 on discussion of meaningful time metrics. Note that this differs from that discussion in your codebook you want to describe how you collected the data, not necessarily how you want to analyze the data.
new_name: This is a column that brings together much of the information we’ve already collected. It’s purpose is to be the new name that we will give to the variable that is more useful and descriptive to us. This is a constructed variable that brings together others. I like to make it a combination of “category”, “label”, “item_name”, and year using varying combos of "_" and “.” that we can use later with tidyverse functions. I typically construct this variable in Excel using the CONCATENATE() function, but it could also be done in R. The reason I do it in Excel is that it makes it easier for someone who may be reviewing my codebook.
There is a separate discussion to be had on naming conventions for your variables, but the important idea to remember is that names convey important information and we want to use this information later on to make our life easier. By coding these variables using this information AND systematically using different separators we can accomplish this goal.
These are just suggestions, but after working with many longitudinal datasets I will say all of them are horrible in some way. Doing this makes them less horrible. Is it some upfront work? Yes. Will it ultimately save you time? Yes. Also, if you know this prior to runnign a study you are making some sort of code book anyways, right, right? Might as well kill two birds with one stone.
You can make the codebook anyway you want, but the two best options are miscrosoft excel and google pages. Not because they are necessarily the best functioning but because they are relatively ubiquitous and are easy to share.
We will create a codebook but then bring the codebook into R via turning it into a csv. You can rethink the codebook as a way of coding prior to putting anything into R.
Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.
codebook <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/codebook.csv")
codebook <- codebook %>%
mutate(old_name = str_to_lower(old_name))
head(codebook)
dataset old_name item_text
1 persnr Never Changing Person ID
2 hhnr household ID
3 ppfad gebjahr Year of Birth
4 ppfad sex Sex
5 vp vp12501 Thorough Worker
6 zp zp12001 Thorough Worker
scale
1
2
3 numeric
4 \n1 [1] Male\n2 [2] Female\n-1 [-1] No Answer\n-2 [-2] Does not apply\n-3 [-3] Answer improbable\n-4 [-4] Inadmissible multiple response\n-5 [-5] Not included in this version of the questionnaire\n-6 [-6] Version of questionnaire with modified filtering\n56186 57630 24 0 0 0 0 0
5
6
category label item_name year new_name reverse
1 Procedural SID 0 Procedural__SID 1
2 Procedural household 0 Procedural__household 1
3 Demographic DOB 0 Demographic__DOB 1
4 Demographic Sex 0 Demographic__Sex 1
5 Big 5 C thorough 2005 Big 5__C_thorough.2005 1
6 Big 5 C thorough 2009 Big 5__C_thorough.2009 1
mini maxi recode
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 1 7
6 1 7
First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.
Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and somethng we will return to in a later tutorial after we have more skills. I’ve left that code for now, but it won’t make a lot of sense right now.
This code below shows how I would read in and rename a wide-format data set using the codebook I created.
old.names <- codebook$old_name # get old column names
new.names <- codebook$new_name # get new column names
soep <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/soepdata.csv")
soep <- soep %>% # read in data
dplyr::select(old.names) %>% # select the columns from our codebook
setNames(new.names) # rename columns with our new names
paged_table(soep)